Discriminatory punishment undermines the enforcement of group cooperation

Peer punishment can help groups to establish collectively beneficial public goods. However, when humans condition punishment on other factors than poor contribution, punishment can become ineffective and group cooperation deteriorates. Here we show that this happens in pluriform groups where members have different socio-demographic characteristics. In our public good provision experiment, participants were confronted with a public good from which all group members benefitted equally, and in-between rounds they could punish each other. Groups were uniform (members shared the same academic background) or pluriform (half the members shared the same academic background, and the other half shared another background). We show that punishment effectively enforced cooperation in uniform groups where punishment was conditioned on poor contribution. In pluriform groups, punishment was conditioned on poor contribution too, but also partially on others’ social-demographic characteristics—dissimilar others were punished more than similar others regardless of their contribution. As a result, punishment lost its effectiveness in deterring free-riding and maintaining public good provision. Follow-up experiments indicated that such discriminatory punishment was used to demarcate and reinforce subgroup boundaries. This work reveals that peer punishment fails to enforce cooperation in groups with a pluriform structure, which is rule rather than exception in contemporary societies.


Experimental Procedure
Upon arrival in the laboratory, participants were seated in individual cubicles, each containing a personal computer that was used to present the instructions and register their decisions. The experiment began by informing participants that they would engage in a group decision-making task in which they would interact with fellow students from the study programmes Psychology and Pedagogical Science. We assessed the extent to which participants felt affiliated with other psychology and pedagogy students, and students from the Faculty of Social and Behavioural Sciences in general, on a 6-point Likert scale ranging from 1 (completely disagree) to 6 (completely agree) (Figure S1-S3; items adapted from 1,2 ; αown = 0.86, αother = 0.91, αgeneral = 0.89).
Next, participants received some general instructions about the experiment ( Figures S4 and S5).
This was followed by more detailed instructions and comprehension questions about the multiround public goods game (PGG) they faced in the first block ( Figures S6, S7, S9, and S10), which was either without punishment ( Figure S8) or with punishment (Figures S18, S19, and S20). After the comprehension questions, the first block started. Each round, participants first made their contribution decision (Figures S11 and S12) and then received feedback about the contribution decisions of each group member ( Figure S13). If applicable for this block, participants made their punishment decisions right after ( Figure S22) and then received feedback about the punishments that each group member received ( Figure S23). Finally, participants received an overview of the round (Figure S14 or S24) before moving to the next round. After 20 rounds, the first block was finished and we assessed participants beliefs about the frequency of free-riding by the other group members in the first block ( Figure S15).
Then, participants proceeded to the next block ( Figure S16) and learned that this second block of interactions was with punishment (Figures S17-S21) or without punishment (Figures S17, 5 S8, and S21). Each round, and similar to the first block, participants first made their contribution decision (Figures S11 and S12) and then received feedback about the contribution decisions of each group member ( Figure S13). Again, if applicable for this block, participants made their punishment decisions right after ( Figure S22) and then received feedback about the punishments each group member received ( Figure S23). Finally, participants received an overview of the round ( Figure S24 or S14) before moving to the next round. After 20 rounds, the second block was finished and we assessed participants beliefs about the frequency of freeriding by the other group members in the second block ( Figure S15) and the PGG was thereafter finished ( Figure S16). Finally, participants completed the social value orientation slider measure ( Figure S25) 3 , and we asked their demographics together with questions probing their experience with behavioural experiments ( Figure S26). 6 Figure S1. First assessment of felt affiliation. Example of psychology student.

Participants and Experimental Design
Experiment 2 was conducted in the behavioural laboratory, located in the building of the Faculty of Social and Behavioural Sciences of Leiden University. A total of 276 first-year psychology students (n = 147) and pedagogy students (n = 129) from this university participated (232 women and 44 men; Mage = 19.14, SDage = 2.15 years). The sample size was determined based on feasibility concerns rather than a priori power calculations (see Supplementary Results for a sensitivity analysis). Given the number of first-year students in the study programmes Psychology and Pedagogical Science, and the time available in the laboratory, we aimed to create 52 pluriform groups (requiring 156 psychology students and 156 pedagogy students).
To examine punishment behaviour among both freshmen and relatively more established psychology and pedagogy students, the data was collected both at the start of the first semester and during the second semester of the academic year (we aimed to create 26 groups in each semester). Participants were allowed to take part in the experiment only once, either in the first semester (n = 175) or the second semester (n = 101), and they were randomly assigned to either the give-some treatment (n = 138) or the take-some treatment (n = 138), while keeping the distribution of psychology students and pedagogy students equal across treatments. We initially recruited 278 participants, but later had to exclude 2 participants because their decisions were not recorded correctly due to a technical error.
Throughout the instructions, it was noted several times that the interactions with the other psychology and pedagogy students were not live, but that they would specify binding decision schemas for the interactions (i.e., we used the so-called strategy method). An advantage of the strategy method is that we collected information about punishment in response to all potential decisions that participants could make, which increased the statistical power of our results and 22 allowed us to observe the complete conditional strategy of participants. After the data of all participants in the experiment were collected, it was randomly determined who interacted with whom, and each participant's outcome was calculated based on their actual decisions and punishment strategies. The total amount of Monetary Units (MU) they earned was converted to euros at the following rates: 10 MU = € 0.50. Participants could earn between €0 and €14.25.
They earned, on average, €7.75. Two weeks after the experiment, participants could collect their additional payments in cash. In addition to the money, participants also received a personal feedback sheet that provided complete information about how their additional payment was calculated.
Experiment 2 consisted of the following stages: A public goods game stage (S1), and a thirdparty punishment game stage (S2). At S1, participants faced a linear one-shot PGG, which was either presented as give-some or take-some game, depending on the treatment participants were in. Participants performed the PGG in a pluriform group with two students from their own study programme and three students from the other study programme, i.e., a 6-person group with 3 psychology students and 3 pedagogy students. At S2, participants performed a third-party punishment game (TPG) in response to the contribution decisions (in the give-some treatment) or consumption decisions (in the take-some treatment) by members of another 6-person group.
That is, as third parties with individual punishment capacity, they oversaw public good provision by another pluriform group.

Experimental Procedure
Upon arrival in the laboratory, participants were seated in individual cubicles, each containing a personal computer that was used to present the instructions and register their decisions. The experiment always began by informing participants that they would engage in a group decisionmaking task in which they would interact with fellow students from the study programmes 23 Psychology and Pedagogical Science, and an assessment of the extent to which they felt affiliated with other students from each of these study programmes (see Materials below).
The instructions explained to participants that the group decision making task consisted of a stage in which they had to decide to what extent they served their own interest or the interest of a group (S1), and a stage in which they could decrease the outcomes of persons in another group (S2). Specifically, participants learned that in S1 they were part of a 6-person group with students from both the study programmes Psychology and Pedagogical Science.
In the give-some treatment, participants learned that in S1 each person in the 6-person group was endowed with 100 MU and could give between 0 to 100 MU (in steps of 10 MU) to a group account. The MU given to the group account would be multiplied by 1.5 and divided equally among the entire 6-person group, and the MU kept for oneself would be transferred to the participant's private account. We refer to the MU given to the group account as contributions, and to the MU kept for oneself as non-contributions. In the take-some treatment, participants learned that in S1 each person in the 6-person group could take between 0 to 100 MU (in steps of 10 MU) from a group account of 600 MU. The MU taken from the group account would be transferred to the participant's private account, and the MU left in the group account would be multiplied by 1.5 and divided equally among the entire 6-person group. We refer to the MU taken from the group account as consumptions, and to the MU left in the group account as nonconsumptions.
Note that across the two treatments, the two versions of the PGG had the same underlying outcome structure and were thus structurally equivalent 4 . In both treatments, the cost of cooperation was higher than the individual return, because each contribution (give-some treatment) or non-consumption (take-some treatment) of 10 MU resulted in a group return of 15 MU (10 x 1.5) and an individual return of 2.5 MU (15 / 6). Therefore, it was always in the 24 material self-interest of any participant to free-ride on the other group members' cooperation by non-contributing/consuming all MU.
It was further explained that participants could increase either the joint outcome of their 6person group by contributing MU to the group account (in the give-some treatment) or nonconsuming MU from the group account (in the take-some treatment), or their individual outcome by non-contributing MU to the group account (in the give-some treatment) or consuming MU from the group account (in the take-some treatment). Examples were given of possible scenarios in S1 (e.g., when one group member would free-ride, when none of the group members would cooperate). Following the detailed instructions about S1, the participants received comprehension questions to test their understanding of S1 (comparable to the comprehension questions of Experiment 1), with feedback on the correct answer after each question.
We then repeated that each 6-person group would consist of 3 psychology students and 3 pedagogy students, and emphasized the interdependence among the two subgroups within the larger group. Next, we assessed participants general trust toward psychology and pedagogy students, and how threatened they felt by psychology and pedagogy students (see Materials below).
Before participants made their contribution/consumption decision in S1, they were first instructed about S2. Participants learned that each group member was endowed with an additional 60 MU, which they could use to assign decrement points (DP) to members of another 6-person group (10 MU per person). For all possible contributions/consumptions in S1, participants could assign between 0 to 10 DP. Each DP reduced the final earnings of each punished target by three MU and would cost the punisher one MU. Thus, the self-to-other cost ratio of assigning a DP to someone was 1:3. The MU not used to assign DP would be transferred 25 to the participant's private account. Participants learned that they had to specify their response strategy twice: Once for contributions/consumptions made by psychology students and once for contributions/consumptions made by pedagogy students. Examples were given of possible scenarios in S2 (e.g., when multiple members of the other group would opt for a contribution/consumption for which the participant assigned DP).
While participants were third parties with individual punishment capacity, overseeing the contribution/consumption decisions of members in another pluriform group, yet another pluriform group would oversee the contribution/consumption decisions of their own pluriform group. That is, participants learned that, just as they (group A) could assign DP to members of another 6-person group (group B), members of yet another 6-person group (group C) could assign DP to them and their fellow group members. Thus, participants learned that psychology and pedagogy students in another group could decrease their outcome from S1.
Finally, we reminded the participants that the 6-person groups would be randomly formed after all participants had taken part in the experiment, and that each participant's outcome was calculated based on their actual decisions in S1 and S2. Importantly, there was a closed envelope present in each cubicle, which contained an example of the feedback sheet that participants would receive when collecting their additional payment in cash ( Figure S27), and at this stage of the instructions, participants were asked to examine the feedback sheet to get an idea of what information would be provided. Following the detailed instructions about S2, the participants received comprehension questions to test their understanding of the entire experimental procedures (including S1 and S2), with feedback on the correct answer after each question.
After the instructions of S1 and S2, participants first made their contribution/consumption decision (S1) and then specified their response strategies towards the other group (S2). In S1, 26 participants indicated how many MU they contributed to the group account (give-some treatment) or consumed from the group account (take-some treatment) by selecting one of the eleven possible choices (0 to 100 MU, in steps of 10 MU). In S2, the eleven possible choices in S1 were listed and participants indicated for each how many DP they would like to assign if the others would opt for that particular contribution/consumption by typing in a number of DP (0 to 10). After typing in a number, the costs in MU of assigning that number of DP for the participant and the receiver were both shown. Participants specified their assignment of DP once for the 3 psychology students and once for the 3 pedagogy students. To control for sequence effects, whether they first specified their response strategy for psychology or pedagogy students was counterbalanced between participants.
Next, it was explained that there would be a chance that 6-person groups consisting of 3 psychology students and 3 pedagogy students could not be created, and participants were asked whether and how they would want to change their response strategies if the composition would be either 4 psychology students and 2 pedagogy students (i.e., majority of psychology students) or the other way around (i.e., majority of pedagogy students). Participants were shown the response strategies they had specified before and could change them for each of the two alternative compositions (order counterbalanced between participants). Finally, we assessed participants' general positive and negative perceptions of psychology and pedagogy students (see Materials below). We also included an assessment of social value orientation, but due to a technical error we had to drop this measure. At the end of the experiment, participants were thoroughly debriefed, were given instructions about how to collect their additional payments, and were thanked for their participation.

Materials
To assess the extent to which participants felt affiliated with other students from the study programmes Psychology and Pedagogical Science, they rated the applicability of four statements on a 7-point Likert scale ranging from 1 (disagree) to 7 (agree), twice: Once about psychology students and once about pedagogy students ("I identify with psychology/pedagogy students," "I feel connected to psychology/pedagogy students," "I feel involved with psychology/pedagogy students," and "I see myself as belonging to the group of psychology/pedagogy students;" adapted from 1,2 ; αown = 0.85, αother = 0.84).
To assess the extent to which participants generally trust other students from the study programmes Psychology and Pedagogical Science, they rated the applicability of eight statements on a 7-point Likert scale ranging from 1 (disagree) to 7 (agree), twice: Once about psychology students and once about pedagogy students ("I believe that psychology/pedagogy students tend to keep/take many MU for themselves," "I believe that psychology/pedagogy students tend to think about their self-interest," "I believe that psychology/pedagogy students tend to put self-interest above group interest," "I believe that psychology/pedagogy students tend to give/leave few MU to/in the group account," "I believe that psychology/pedagogy students can be trusted to put their self-interest aside," "I believe that psychology/pedagogy students can be trusted to think about the interest of the group," "I believe that psychology/pedagogy students can be trusted to do something good for the group," "I believe that psychology/pedagogy students can be trusted to contribute many MU to the group account/consume few MU from the group account;" adapted from 5,6 ; αown = 0.92, αother = 0.91).
To assess the extent to which participants felt threatened by other students from the study programmes Psychology and Pedagogical Science, they rated the applicability of two statements on a 7-point Likert scale ranging from 1 (disagree) to 7 (agree), twice: Once about 28 psychology students and once about pedagogy students ("When I think about psychology/pedagogy students giving few MU to the group account/taking many MU from the group account, I feel threatened," "When I think about psychology/pedagogy students giving few MU to the group account/taking many MU from the group account, I feel attacked;" adapted from 1 ; αown = 0.86, αother = 0.85).
To assess participants' general positive perceptions of other students from the study programmes Psychology and Pedagogical Science, they rated the applicability of four statements on a 7-point Likert scale ranging from 1 (disagree) to 7 (agree), twice: Once about psychology students and once about pedagogy students ("I generally find psychology/pedagogy students generous," "I generally find psychology/pedagogy students helpful," "I generally find psychology/pedagogy students bounteous," "I generally find psychology/pedagogy students social;" αown = 0.78, αother = 0.79).
To assess participants' general negative perceptions of other students from the study programmes Psychology and Pedagogical Science, they rated the applicability of four statements on a 7-point Likert scale ranging from 1 (disagree) to 7 (agree), twice: Once about psychology students and once about pedagogy students ("I generally find psychology/pedagogy students greedy," "I generally find psychology/pedagogy students covetous," "I generally find psychology/pedagogy students stingy," "I generally find psychology/pedagogy students selfish;" αown = 0.90, αother = 0.90).
29 Figure S27. Example of the feedback sheet.  for a sensitivity analysis). Given the number of first-year students in the study programmes Psychology and Pedagogical Science, and the time available in the laboratory, we aimed to create 32 pluriform groups (requiring 96 psychology students and 96 pedagogy students). The data were collected in the first semester of the academic year. Participants were randomly assigned to either the give-some treatment (n = 89) or the take-some treatment (n = 90), while keeping the distribution of psychology students and pedagogy students equal across treatments.

Participants and Experimental Design
The research approach was similar to Experiment 2. We again used the strategy method and randomly determined who interacted with whom after the data of all participants in the experiment was collected. The total amount of MU participants earned was converted to euros at the following rates: 10 MU = € 0.25. Participants could earn between €0 euros and €14.25.
They earned, on average, €7.81. Two weeks after the experiment, participants could collect their additional payments in cash. In addition to the money, and similar to Experiment 2, participants also received a personal feedback sheet that provided complete information about how their additional payment was calculated. Experiment 3 consisted of the following stages: A public goods game stage (S1), and a thirdparty punishment game stage (S2). At S1, participants faced two linear one-shot PGG, which were either presented as give-some or take-some game, depending on the treatment participants 31 were in. First, participants performed a PGG in a uniform group with two students from their own study programme, i.e., a 3-person group with either psychology or pedagogy students.
Second, participants performed a PGG in a pluriform group with two students from their own study programme and three students from the other study programme, i.e., a 6-person group with 3 psychology students and 3 pedagogy students. At S2, participants performed a TPG in response to the contribution decisions (in the give-some treatment) or consumption decisions (in the take-some treatment) by members of two other 3-person groups (one with psychology students and one with pedagogy students) and members of one other 6-person group. That is, as third parties with individual punishment capacity, they oversaw public good provision by two other uniform groups and one other pluriform group.

Experimental Procedure
The experimental procedure was similar to the procedure of Experiment 2, except for the number of PGG and TPG they faced. The instructions explained to participants that the group decision making task consisted of a stage in which they had to decide to what extent they served their own interest or the interest of the two groups (S1), and a stage in which they could decrease the outcomes of persons in other groups (S2). Specifically, participants learned that in S1 they would be a member of two different groups: A 3-person group with students from their own study programme (either psychology or pedagogy students), and a 6-person group with other students from both the study programmes Psychology and Pedagogical Science.
In the give-some treatment, participants learned that each person would be endowed with 100 MU and could contribute between 0 to 100 MU (in steps of 10 MU) to the group account of their 3-person group. The MU contributed to this group account would be multiplied by 1.5 and divided equally among the entire 3-person group, and the MU not contributed to this group would be transferred to the participant's private account. In addition, participants learned that each person would be endowed with another 100 MU and could contribute between 0 to 100 MU (in steps of 10 MU) to a group account of their 6-person group. The MU contributed to this group account good would be multiplied by 1.5 and divided equally among the entire 6-person group, and the MU not contributed to this group account would be transferred to the person's private account.
In the take-some treatment, participants learned that each person could consume between 0 to 100 MU (in steps of 10 MU) from the group account of 300 MU of their 3-person group. The MU consumed from this group account would be transferred to the participant's private account, and the MU not consumed from this group account would be multiplied by 1.5 and divided equally among the entire 3-person group. In addition, participants learned that each person could also consume between 0 to 100 MU (in steps of 10 MU) from a group account of 600 MU of their 6-person group. The MU consumed from this group account would be transferred to the participant's private account, and the MU not consumed from this group account would be multiplied by 1.5 and divided equally among the entire 6-person group.
Note that the two contribution/consumption decisions were presented as independent decisions, involving different group accounts and different group members. Note also that across the two treatments, the two versions of the two PGG had the same underlying outcome structures and were thus structurally equivalent 4 . However, because the group size differed across the two PGG that participants faced (i.e., a 3-person versus a 6-person group), their underlying outcome structures were comparable but not exactly the same.
Similar to Experiment 2, participants were first instructed about S2 before they made their contribution/consumption decisions in S1. Participants learned that each group member was endowed with an additional 120 MU and could use these MU to assign decrement points (DP) to members of three other groups (10 MU per person). More specifically, it was explained that 33 they had to do this for members of (i) a 3-person group with psychology students, (ii) a 3-person group with pedagogy students, and (iii) a 6-person group with 3 psychology students and 3 pedagogy students. For all possible contributions/consumptions in S1, participants could assign between 0 and 10 DP to each member of the other group if they would opt for that particular contribution/consumption. Each DP reduced the final earnings of each punished target by three MU and cost the punisher one MU. Thus, the self-to-other cost ratio of assigning a DP to someone was 1:3. The MU not used to assign DP would be transferred to the participant's private account. Participants learned that they had to specify their four response strategies: Once for contributions/consumptions by psychology students in the 3-person group, once for contributions/consumptions by pedagogy students in the 3-person group, once for contributions/consumptions by psychology students in the 6-person group, and once for contributions/consumptions by pedagogy students in the 6-person group.
While participants were third parties with individual punishment capacity, overseeing the contribution/consumption decisions of members in two other uniform groups and one other pluriform group, other uniform and pluriform groups would oversee the contribution/consumption decisions of their own uniform and pluriform groups. That is, participants learned that, just as they (group A) could assign DP to members of two other 3person groups (groups B), members of two other 3-person groups (groups C) could assign DP to them and their fellow group members. Thus, participants learned that psychology and pedagogy students in other 3 and 6-person groups could decrease their outcome from S1. Similar to Experiment 2, participants were asked to examine the feedback sheet that they would receive when collecting their additional payment in case ( Figure S28). Also similar to Experiment 2, participants first made their contribution/consumption decisions (S1) and then specified their response strategies (S2). In S1, participants always indicated first how many MU they contributed to the group account (give-some treatment) or consumed from the group 34 account (take-some treatment) by selecting one of the eleven possible choices (0 to 100 MU, in steps of 10 MU). Participants always indicated their contribution/consumption first for the 3person group and then for the 6-person group. In S2, the eleven possible choices in S1 were listed and participants indicated for each how many DP they assigned if the others opted for that particular contribution/consumption by typing in a number of DP (0 to 10). After typing in a number, the costs in MU of assigning that number of DP for the participant and the receiver were both shown. Although participants always indicated their assignment of DP first for the 3-person groups and then for the 6-person group, whether they first specified their response strategy for psychology or pedagogy students was counterbalanced between participants.

Materials
We used the same measures as in Experiment 2 (see Supplementary Methods) to assess the extent to which participants (a) felt affiliated with other psychology and pedagogy students

Statistical Procedures
Here, we describe the statistical modelling strategy for the results reported in the main manuscript. The data of our three experiments were hierarchically structured, because each observation was nested in participants and, in Experiment 1, groups. To account for the dependency of observations, we fitted mixed-effects regression models using the lme4 package in R. To derive p-values, we applied the Satterthwaite's method 7 , and we used a two-sided pthreshold of 5% to determine significance in all models.

Experiment 1
To analyse the total group contribution and the total group wealth, we specified separate linear mixed-effects regression models (fitted by maximum likelihood), with a random-effect for groups. To analyse free-riding, the frequency of receiving punishments, and the frequency of punishment, we specified separate generalized linear mixed-effects logistic regression models (fitted by maximum likelihood using the Laplace approximation 8 ), with two random-effect intercepts for groups and participants. To analyse the costs of receiving punishments and the expenditure on punishment, we specified separate generalized linear mixed-effects Poisson (logit) regression models (fitted by maximum likelihood using the Laplace approximation 8 ), with two random-effect intercepts for participants and groups. In all these models, we included fixed-effect predictors for round and block order to control for their effects.

Experiments 2 and 3
To analyse the frequency of third-party punishment, we specified generalized linear mixedeffects logistic regression models (fitted by maximum likelihood using the Laplace approximation 8 ), with a random-effect intercept for participants. To analyse the expenditure on third-party punishment, we specified generalized linear mixed-effects Poisson (logit) regression 38 models (fitted by maximum likelihood using the Laplace approximation 8 ), with a random-effect intercept for participants.
For all possible choices in the PGG (by dissimilar others versus similar others), participants made a punishment decision. To determine whether a specific contribution (give-some treatment) or consumption (take-some treatment) can be considered an act of free-riding or cooperation, we took participants own contribution (consumption) in the PGG as reference point and coded comparatively lower contributions (higher consumptions) by others as freeriding, and contributions equal or above (consumptions equal or below) this point as cooperation (for a similar procedure, see 9 ). For example, if a participant in the give-some treatment contributed 60 MU in the PGG, we coded a contribution of 0 to 50 MU by the target as free-riding and a contribution of 60 to 100 MU as cooperation. Likewise, if a participant in the take-some treatment consumed 40 MU in the PGG, we coded a consumption of 50 to 100 MU by the target as free-riding and a consumption of 0 to 40 MU as cooperation. To control for the effects of the different contribution-levels (consumption-levels) regardless of whether this is coded as free-riding or cooperation, we first reverse-recoded the different consumptionlevels in the take-some treatment (to match them with the different contribution-levels in the give-some treatment) and then included a fixed-effect predictor for target's possible contributions/non-consumptions in all our models.
In addition, we also included fixed-effect predictors in all our models that coded whether participants either made decisions in the give-some or take-some treatment, decided about punishing dissimilar and similar others in different orders, and/or were either freshmen or relatively more established students (only in the models for Experiment 2).

Supplementary Results
For each experiment, we first provide the full models underlying the results reported in the main manuscript and then the additional and/or exploratory analyses that were not the main focus of this research. Finally, we provide sensitivity power analyses for our three experiments.

Total group contribution and wealth
The total contributions of groups were higher with than without punishment in the uniform groups, but significantly less so in the pluriform groups (Table S1, column 1; punishment × group structure interaction coefficient and punishment coefficient). In a similar vein, the total earnings of groups were higher with than without punishment in the uniform groups, but significantly less so in the pluriform groups (Table S1, column 2; punishment × group structure interaction coefficient and punishment coefficient).

Free-riding
On the individual-level, free-riding (i.e., when a participant was endowed but did not contribute) was less frequent with than without punishment in the uniform groups, but significantly less so in the pluriform groups (Table S2; punishment × group structure interaction coefficient and punishment coefficient).

Frequency and costs of receiving punishments
The differential effects of punishment across the uniform and pluriform groups reported above cannot be explained by the overall frequency and costs of receiving punishments. Participants received punishments from others as frequent in pluriform groups as in uniform groups (Table   S3, column 1; group structure coefficient), and the average costs of receiving punishments were also the same (Table S3, column 2; group structure coefficient).

Frequency of and expenditure on punishment across uniform and pluriform groups
Participants punished as frequent in pluriform groups as in uniform groups (Table S4, column 1; group structure coefficient), and mainly directed their punishments at non-contributors rather than contributors (Table S4, column 1; target contributed coefficient). However, the difference in punishment of non-contributors and contributors was overall smaller in the pluriform compared to the uniform groups ( Likewise, participants incurred similar costs to punish in the pluriform group as in the uniform groups (Table S5, column 1; group structure coefficient), and they incurred more costs to punish non-contributors than to punish contributors (Table S5, column 1; target contributed coefficient). The difference in the incurred costs to punish non-contributors and contributors, however, was overall smaller in the pluriform than in the uniform groups (Table S5, column 2; group structure × target contributed interaction coefficient; Table S5, columns 3 & 4; target contributed coefficients). Participants incurred more costs to punish when they themselves had contributed in the current round (Table S5, column 1; source contributed coefficient), and when they had received punishment themselves in the previous round (Table S5, column 1; punishment received t-1 coefficient).

Frequency of and expenditure on punishment in pluriform groups
Participants punished dissimilar others more frequently than similar others (Table S6, column 1; target's subgroup coefficient), and such discriminatory punishment was unaffected by whether someone had contributed or not (

Discriminatory punishers
To see how many participants punished dissimilar others more than the similar other, and thus engaged in discriminatory punishment, we calculated a difference score for each participant of These difference scores capture discriminatory punishment (i.e., positive value = they punished the dissimilar others more than the similar other; negative value = they punished the similar other more than the dissimilar others; zero = they punished the similar and dissimilar others equally) and thus allows us to identify whether or not participants, on average, were discriminatory punishers. Figure S29 shows that the majority of participants were indeed discriminatory punishers (62.5% in terms of frequency and 68.1% in terms of expenditure), some were more punitive towards similar others, but most towards dissimilar others. 49

Felt affiliation
For each participant, the experiment always started with an assessment of their felt affiliation with other psychology and pedagogy students, and students from the Faculty of Social and Behavioural Sciences in general. This allowed us to test whether participants felt more affiliated with similar others (i.e., students from their own study programme; e.g., psychology students) than dissimilar others (i.e., students from the other study programme; e.g., pedagogy students) and others from the overarching group in general (i.e., students from the Faculty of Social and Behavioural Sciences in general). We specified a linear mixed-effects regression model (fitted by maximum likelihood), with a random-effect intercept for participants, and two fixed-effect Next, we calculated a difference score for each participant, capturing their relative affiliation with similar over dissimilar others (i.e., positive value = they felt more affiliated with similar others than with dissimilar others; negative value = they felt less affiliated with similar others than with dissimilar others), and we explored whether this difference in felt affiliation was associated with discriminatory punishment in the pluriform groups. That is, we added the difference score (mean centred) and its interaction with target's subgroup as fixed-effect predictors to the initial models we ran on frequency of punishment and expenditure on punishment in the pluriform groups (for the initial models, see Table S6, columns 1 & 3). These new models both yielded a significant difference score × target's subgroup interaction coefficient (frequency: b ± se = 0.19 ± 0.09, P = 0.026; expenditure: b ± se = 0.10 ± 0.04, P = 0.010). This indicates that participants that felt more affiliated with similar rather than 50 dissimilar others, also exhibited more discriminatory punishment, both in terms of frequency of punishment and expenditure on punishment.

Beliefs
After each block, we assessed participants beliefs about the frequency of free-riding by the other group members in that specific block. This allowed us to test to what extent participants perceived their group members as free-riders, depending on the availability of punishment (absent versus present), the structure of the group (uniform versus pluriform), and the others' subgroup (similar versus dissimilar).
First, for each participant, we calculated the average expected percentage of free-riding in the block with punishment and in the block without punishment. We specified linear mixed-effects regression models (fitted by maximum likelihood), with a random-effect intercept for participants. In the first model, we included two fixed-effect predictors for punishment (0 = absent; 1 = present) and group structure (0 = uniform; 1 = pluriform), as well as a fixed-effect predictor for block order (0 = without punishment first; 1 = with punishment first) to control for its effects. In the second model, we also included a fixed-effect predictor for the punishment × group structure interaction. These models yielded a significant punishment coefficient, indicating that participants believed that their group members were free-riding less frequently with punishment (M% = 36.78, SE% = 1.33) than without punishment (M% = 41.80, SE% = 1.33), b ± se = -5.02 ± 1.88, P = 0.008. The group structure coefficient (b ± se = 4.73 ± 4.45, P = 0.290) and the punishment × group structure interaction coefficient (b ± se = 4.76 ± 3.73, P = 0.204) were both non-significant.
Second, for each participant in the pluriform group, we calculated the average expected percentage of free-riding by the one similar other and two dissimilar others in their pluriform group. We specified linear mixed-effects regression models (fitted by maximum likelihood), with a random-effect intercept for participants. In the first model, we included two fixed-effect predictors for punishment (0 = absent; 1 = present) and target's subgroup (0 = similar; 1 = dissimilar), as well as a fixed-effect predictor for block order (0 = without punishment first; 1 = with punishment first) to control for its effects. In the second model, we also included a fixedeffect predictor for the punishment × target's subgroup interaction. The coefficients of punishment (b ± se = -2.40 ± 2.08, P = 0.250), target's subgroup (b ± se = 1.00 ± 2.08, P = 0.630), and the punishment × target's subgroup interaction (b ± se = 1.46 ± 4.15, P = 0.726) were all non-significant.
Combined, these analyses suggest that our introduction of a pluriform group structure did not impact participants' beliefs about the frequency of free-riding by others in their group. Thus, although we observed discriminatory punishment in the pluriform groups, such subgroup-based discrimination may not be rooted in different beliefs about group members.

Social value orientation
For each participant, the experiment ended with an assessment of their social value orientation (SVO), which allowed us to check whether social preferences were comparable across uniform and pluriform groups. Figure S30 shows, for each group, the average deviation of group members' SVO score from the pre-determined boundary between the categories prosocial and individualistic in the SVO task (SVO score = 22.45) 3 . As can be seen, the majority of the groups were, on average, prosocially rather than individualistically oriented. More importantly, SVO scores were similar in uniform and pluriform groups. A linear regression model showed that  Next, we explored whether discriminatory punishment emerged even when controlling for SVO, and whether SVO was associated with the emergence of discriminatory punishment.
Therefore, we first added the SVO score (mean centred) and, secondly, also its interaction with target's subgroup as fixed-effect predictors to the initial models we ran on frequency of punishment and expenditure on punishment in the pluriform groups (for the initial models, see Table S6,

Frequency of and expenditure on TP punishment
Like in Experiment 1, we again found that participants mainly directed their punishments at free-riders rather than cooperators (Table S7, column 1; target a free-rider coefficient), and incurred more costs to punish these free-riders (Table S7, column 3; target a free-rider coefficient). Moreover, participants' own contribution level was associated with both the frequency of punishment (Table S7, column 1; source's contribution coefficient) and the expenditure on punishment (Table S7, column 3; source's contribution coefficient), indicating that high contributors punished more than low contributors (note that the consumptions in the take-some treatment were reverse-coded; see explanation below under Contribution).
Crucially, and complementing Experiment 1, participants punished dissimilar others more frequently than similar others (Table S7, column 1; target's subgroup coefficient), irrespective of whether the target was free-riding or cooperating (Table S7, column 2; target's subgroup × target a free-rider interaction coefficient). Likewise, participants incurred more costs to punish dissimilar others than dissimilar others (Table S7, column 3; target's subgroup coefficient), irrespective of whether the target was free-riding or cooperating (Table S7,

Discriminatory punishers
Like in Experiment 1, we again calculated a difference score for each participant of both their average frequency of punishment and their average expenditure on punishment to identify how many participants engaged in discriminatory punishment. In Experiment 2, participants specified punishment strategies (rather than punishing across rounds as was the case in Experiment 1), and we subtracted the average frequency (expenditure) with which each participant punished similar others from the average frequency (expenditure) with which they punished dissimilar others across all possible contributions. Hence, this difference score also captures discriminatory punishment (i.e., positive value = they punished dissimilar others more than similar others; negative value = they punished similar others more than dissimilar others; zero = they punished the similar and dissimilar others equally) and again allows us to identify whether or not participants, on average, were discriminatory punishers. Figure S31 shows that, in contrast to Experiment 1, the majority of participants punished dissimilar others equally to similar others and, thus, were not discriminatory punishers. Of the participants who were discriminatory punishers (23.6% in terms of frequency and 34.4% in terms of expenditure), most of them directed this towards dissimilar rather than similar others.
Whereas the difference scores in Experiment 1 were calculated based on participants' average punishment across rounds in the repeated interaction, the difference scores in Experiment 2 were calculated based on participants' average punishment across all possible contributions in the one-shot interaction. This difference, together with the fact that participants were third parties overseeing the public good provision of another pluriform group without being subject to noise about others' intentions, may explain the difference in results between Experiments 1 and 2.

Change in punishment strategy for alternative group compositions
Participants were asked whether and how they wanted to change their punishment strategies if the composition would not be 3 psychology and 3 pedagogy students, but either 4 psychology students and 2 pedagogy students (i.e., majority of psychology students) or vice versa (i.e., majority of pedagogy students). This allowed us to see whether the observed patterns of discriminatory punishment would change when dissimilar others would either become a majority or minority in the pluriform group. To analyse participants' punishment strategies across the three group compositions (i.e., equal, dissimilar majority, dissimilar minority), we extended the initial models we ran on frequency of punishment and expenditure on punishment by including fixed-effect contrasts for dissimilar majority (= 1; equal = 0) and dissimilar minority (= 1; equal = 0), as well as their interactions with target's subgroup.
Interestingly, these additional models showed that when dissimilar others would become a majority, the difference in the expenditure on punishment between dissimilar others and similar 58 others became larger (dissimilar majority × target's subgroup interaction; b ± se = 0.06 ± 0.03, P = 0.038), but not the difference in the frequency of punishment (dissimilar majority × target's subgroup interaction; b ± se = 0.16 ± 0.15, P = 0.298). When dissimilar others would become a minority, by contrast, both the difference in frequency of punishment (dissimilar minority × target's subgroup interaction; b ± se = -0.06 ± 0.15, P = 0.673) and expenditure on punishment (dissimilar majority × target's subgroup interaction; b ± se = 0.02 ± 0.03, P = 0.614) remained the same. Irrespective of group composition, dissimilar others were punished more than similar others, both in terms of frequency of punishment (b ± se = 0.44 ± 0.06, P ≤ 0.001) and expenditure on punishment (b ± se = 0.10 ± 0.01, P ≤ 0.001).

Contribution
Before participants specified their punishment strategies, they had first made a contribution decision (in the give-some treatment) or consumption decision (in the take-some treatment) themselves. To include this contribution/consumption decision as predictor in the above models, we reverse-recoded the different consumption-levels in the take-some treatment to match them with the different contribution-levels in the give-some treatment. For example, the consumption of 40 MU equalled a contribution of 60 MU and was, therefore, reverse-coded to a non-consumption of 60 MU. We collapsed these decisions across treatments and refer to them as contributions in the results. Figure S32 shows the frequency of contributions. Participants, on average, contributed 55.36 MU (SD = 29.95) in the PGG.
59 Figure S32. The frequency of contributions.

Felt affiliation
As in Experiment 1, Experiment 2 always started with an assessment of participants felt affiliation with other psychology and pedagogy students. We specified a linear mixed-effects Next, we calculated a difference score for each participant, capturing their felt affiliation with similar others over dissimilar others (i.e., positive value = they felt more affiliated with similar others than with dissimilar others; negative value = they felt less affiliated with similar others than with dissimilar others), and we added this difference score (mean centred) and its interaction with target's subgroup as fixed-effect predictors to the initial models we ran on frequency of punishment and expenditure on punishment in the pluriform groups (for the initial models, see Table S7, columns 1 & 3). Like Experiment 1, this model for expenditure on punishment yielded a significant difference score × target's subgroup interaction coefficient (b ± se = 0.06 ± 0.02, P ≤ 0.001), which was not the case for frequency of punishment (b ± se = 0.13 ± 0.08, P = 0.110). This indicates that participants displayed more discriminatory punishment (in terms of the costs they incurred to punish) the more affiliated they felt with similar others rather than dissimilar others.
In contrast to our first experiment, participants in this second experiment where third parties overseeing the one-shot public good provision of another pluriform group without being subject to noise about others' intentions. One or more of these differences in experimental design may explain why the difference in felt affiliation between dissimilar and similar others was positively associated with the degree of discriminatory punishment in terms of incurred costs but not in terms of frequency.

General trust, felt threat, and perceptions
Throughout the experiment, we assessed participants' perceptions of other students in the study programmes Psychology and Pedagogical Science. More specifically, after participants were instructed about the PGG they faced, we assessed their general trust that other psychology and pedagogy students would serve the collective interest, and how threatened the involvement of other psychology and pedagogy students made them feel. Moreover, at the end of the experiment, we assessed some general positive and negative perceptions of other psychology and pedagogy students. These measures allowed us to assess whether participants had differential perceptions about similar and dissimilar others.

Frequency of and expenditure on TP punishment
Regardless of whether others were in a uniform or pluriform group, participants mainly directed their punishments at free-riders rather than cooperators (Table S8, column 1; target a free-rider coefficient), and incurred more costs to punish these free-riders (Table S9, column 1; target a free-rider coefficient). Moreover, participants own contribution level was associated with the expenditure on punishment of others in the uniform and pluriform groups (Table S9, column 3; source's contribution coefficient), but not with the frequency with which participants punished others in the uniform and pluriform groups (Table S8, column 1; source's contribution coefficient), which indicates that high contributors incurred more costs to punish others in the uniform and pluriform groups than low contributors (as in Experiment 2, consumptions in the take-some treatment were reverse-coded).
Participants punished dissimilar others more frequently than similar others (Table S8, column 1; target's subgroup coefficient), and they incurred more costs to punish them (  This table shows the results from the models estimating the frequency of punishment (0 = no, 1 = yes) as a function of target's subgroup (column 1), and as a function of target's subgroup × target a free-rider (column 2), when only including the uniform groups (column 3), and when only including the pluriform groups (column 4). SEs shown in parentheses. *** P ≤ 0.001, ** P ≤ 0.01, * P ≤ 0.05.

Discriminatory punishers
Similar to Experiments 1 and 2, we again calculated a difference score for each participant of both their average frequency of punishment and their average expenditure on punishment to identify how many participants engaged in discriminatory punishment. However, in Experiment 3, participants specified punishment strategies for uniform and pluriform groups.
Separately for the uniform groups and the pluriform group, we therefore subtracted the average frequency (expenditure) with which each participant punished similar others from the average frequency (expenditure) with which they punished dissimilar others across all possible contributions. Hence, these difference scores both capture discriminatory punishment (i.e., positive value = they punished dissimilar others more than similar others; negative value = they punished similar others more than dissimilar others; zero = they punished the similar and dissimilar others equally) and again allow us to identify whether or not participants, on average, were discriminatory punishers and whether this differed between the uniform groups and the pluriform group.
Complementing Experiment 2, Figure S33 shows that the majority of participants punished dissimilar others equally to similar others in both uniform and pluriform groups and, thus, never were discriminatory punishers. Of the participants who were discriminatory punishers towards dissimilar others (31.8% in terms of frequency and 44.1% in terms of expenditure), most of them either were so in both the uniform and pluriform groups or, more importantly, only in the pluriform group.

Contributions
Before participants specified their punishment strategies, they had first made contribution decisions (in the give-some treatment) or consumption decisions (in the take-some treatment) themselves. Similar to Experiment 2, we reverse-coded the consumption-levels in the takesome treatment to match them with the contribution-levels in the give-some treatment, and collapsed participants' decisions across treatments. We refer to these collapsed decisions as contributions in the results. Figure S34 shows the frequency of contributions in the uniform groups and the pluriform groups. Although the underlying outcome structure of the PGG in the uniform group was not exactly the same as in the pluriform group (due to the difference in group size), we tested for differences in contributions across the two groups. We specified a linear mixed-effects regression model (fitted by maximum likelihood), with a random-effect intercept for participants, and a fixed-effect predictor for group structure (0 = uniform group; 1 = pluriform group). This model showed that participants contributed more to the group account  Next, we again calculated a difference score for each participant, capturing their felt affiliation with similar over dissimilar others (i.e., positive value = they felt more affiliated with similar than dissimilar others; negative value = they felt less affiliated with similar than dissimilar others). We added this difference score (mean centred) and its interaction with target's subgroup and group structure as fixed-effect predictors to the initial models we ran on frequency of punishment and expenditure on punishment (for the initial models, see Tables S8 & S9, columns 2, 3 and 4). This model for expenditure yielded a significant difference score × target's subgroup × group structure interaction coefficient (b ± se = -0.05 ± 0.03, P = 0.048), which was not the case for frequency of punishment (b ± se = -0.10 ± 0.14, P = 0.477). Thus, complementing Experiment 2, participants displayed more discriminatory punishment (in terms of the costs they incurred to punish) in pluriform rather than uniform groups, the more affiliated they felt with similar others rather than dissimilar others.

General trust, felt threat, and perceptions
Throughout the experiment, we assessed participants' perceptions of other students in the study programmes Psychology and Pedagogical Science to assess whether participants had differential perceptions about similar and dissimilar others. For each measure (i.e., general trust, felt threat, positive perceptions, and negative perceptions), we specified a linear mixed-effects

Sensitivity Analyses
In our experiments, sample size was determined based on feasibility concerns rather than a priori power calculations (see Supplementary Methods). We conducted sensitivity power analyses to determine the minimum effect size that could be detected with a power of .80 in our mixed-effects regression models of the key dependent variables. For these estimated models, we substituted the coefficient of interest with a range of coefficients, and on each of these coefficients, we conducted 500 simulated power analyses using the simr package in R 10 . In each simulation, new values for the response variable were simulated using the specified model, the model (with the substituted coefficient) was then refitted to the simulated response, and a statistical test was applied to the simulated fit. Power was calculated from the number of positive and negative runs.

Experiment 1
First, we took the model estimating total group contribution (Table S1, column 1) and substituted the punishment × group structure interaction coefficient with coefficients ranging from b = -3 through b = -7. Second, we also took the model estimating frequency of punishment in the pluriform groups (Table S6,

Experiments 2 and 3
For Experiment 2, we took the model estimating frequency of punishment (Table S7, (Table S8,